Introduction

FIFA Dataset

Chapter 1: Introduction

FIFA 19 is a football (soccer) simulation video game developed by EAsports. It is a part of the FIFA series, which has been produced for over 20 years. Every year, a FIFA game is released, FIFA 2019 was released in 2018, at the beginning of the 2018-2019 season of major soccer leagues in Europe. FIFA 2019 has over 31 leagues and more than 720 playable teams from around the world. This game contains an enormous amount of data which demonstrates different ratings and information of players, ranging from age and nationality to skillsets such as finishing, kicking, heading, tackling and even weak-foot strength.

How do they complete such a huge database of ratings for every single player from all the licensed leagues? EA Sports employs a team of 25 EA Producers and 400 outside data contributors, who are led by the Head of Data Collection & Licensing. This team is responsible for ensuring all player data is up to date, while a community of over 6,000 FIFA Data Reviewers or Talent Scouts from all over the world are constantly providing suggestions and alterations to the database.

In this project, our team will try to catch hold of several insights from the dataset using EDA and other statistical analysis methods.

Sources https://www.ea.com/games/fifa https://www.fifplay.com/fifa-19-leagues-and-teams/ https://www.goal.com/en-ae/news/fifa-player-ratings-explained-how-are-the-card-number-stats/1hszd2fgr7wgf1n2b2yjdpgynu

Chapter 2: Description of Data

2.1 Source Data

In this study, the CSV data file comes from FIFA 2019 database (https://www.kaggle.com/karangadiya/fifa19). This dataset contains 18,207 soccer player information in FIFA 2019 with 89 variables such as name, age, nationality, skill level, potential, club, transferred value, wage, preferred foot, body type, position in the field, height, and weight.

2.2 Geographic Coverage of Data

To do a basic data visualization, latitude and longitude in the dataset are used to draw the map, which is plotted by leaflet library in R.

The number of soccer players in each area of the world

The average skill level of soccer players in each area of the world (level: 0-100)

Smart Question 1

How can we predict a dream team by determining the optimal skills(strength, heading, kick, jump) and vcharacteristics( leftfooted, bodytype, vision)in certain positions? And who would be the best players for the position?

Player Positions, Values, and Skills

Q1: How did the value of players vary according to the positions they play?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 252 rows containing non-finite values (stat_boxplot).

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 204 rows containing non-finite values (stat_boxplot).

Q2: What skills make strikers valuable?

Smart Question 2

What variables contribute most to a player’s skill in penalty shots?

To begin the analysis, we drop all character variables and delete missing values. Additionally we drop the ID variable as it provides no usable information for our analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0    39.0    49.0    48.5    60.0    92.0
## [1] 15.7

##                 Penalties Aggression     Age Stamina ShotPower Balance
## Penalties          1.0000      0.336  0.1394  0.5164     0.795   0.483
## Aggression         0.3364      1.000  0.2652  0.6460     0.492   0.185
## Age                0.1394      0.265  1.0000  0.0979     0.157  -0.090
## Stamina            0.5164      0.646  0.0979  1.0000     0.616   0.475
## ShotPower          0.7952      0.492  0.1570  0.6164     1.000   0.459
## Balance            0.4828      0.185 -0.0900  0.4749     0.459   1.000
## BallControl        0.7699      0.550  0.0851  0.7286     0.831   0.601
## LongPassing        0.5427      0.591  0.1817  0.6358     0.672   0.462
## HeadingAccuracy    0.5520      0.693  0.1472  0.6346     0.612   0.169
## Strength           0.0545      0.474  0.3333  0.2628     0.169  -0.391
##                 BallControl LongPassing HeadingAccuracy Strength
## Penalties            0.7699       0.543           0.552   0.0545
## Aggression           0.5500       0.591           0.693   0.4739
## Age                  0.0851       0.182           0.147   0.3333
## Stamina              0.7286       0.636           0.635   0.2628
## ShotPower            0.8314       0.672           0.612   0.1692
## Balance              0.6009       0.462           0.169  -0.3908
## BallControl          1.0000       0.789           0.658   0.0878
## LongPassing          0.7887       1.000           0.511   0.1143
## HeadingAccuracy      0.6582       0.511           1.000   0.4869
## Strength             0.0878       0.114           0.487   1.0000

The EDA process begins by subsetting the data set to include only variables of interest. We maintain Penalties are our dependent variable and include 9 explanatory variables. The summary statistics for Penalties indicate that players’ penalties skill range from a low of 5.0 to a high of 92.0 with a mean of 49.0 and a stardard deviation of 15.7. The summary output of the correlation matrix suggest that most variables are correlated in some way to other variables. The graphical display of the correlation matrix visually confirms the correlations. The data may be appropriate for linear regression, but the high correlation associated with some variables may suggest multicollinearity.

## 
## Call:
## lm(formula = Penalties ~ ., data = fifa)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.00  -5.15   0.31   5.50  47.81 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.76672    0.64153    1.20     0.23    
## Aggression      -0.16083    0.00599  -26.87  < 2e-16 ***
## Age              0.32554    0.01471   22.13  < 2e-16 ***
## Stamina         -0.05354    0.00671   -7.98  1.6e-15 ***
## ShotPower        0.44502    0.00671   66.36  < 2e-16 ***
## Balance          0.08893    0.00705   12.62  < 2e-16 ***
## BallControl      0.39634    0.01003   39.51  < 2e-16 ***
## LongPassing     -0.12904    0.00722  -17.87  < 2e-16 ***
## HeadingAccuracy  0.17451    0.00645   27.04  < 2e-16 ***
## Strength        -0.05908    0.00761   -7.76  8.8e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.42 on 18137 degrees of freedom
## Multiple R-squared:  0.713,  Adjusted R-squared:  0.713 
## F-statistic: 5e+03 on 9 and 18137 DF,  p-value: <2e-16
##      Aggression             Age         Stamina       ShotPower 
##            2.77            1.21            2.92            3.42 
##         Balance     BallControl     LongPassing HeadingAccuracy 
##            2.54            7.17            3.14            3.22 
##        Strength 
##            2.34

We build a multiple linear regression model where we regress Penalties on Aggression, Age, Stamina, ShotPower, Balance, BallControl, LongPassing, HeadingAccuracy and Strength. The coefficient on the intercept suggests that the mean skill value for players is 0.76, but the coefficient is not statistically significant at any level. However, setting the explanatory variables to zero and interpreting the intercept does not make sense. Age is an explanatory variable, and a zero value of age is not realistic. Additionally, the explanatory variables in the data set do not take on values of zero. We suggest the insignificance of the intercept is OK. The coefficients on all explanatory variables are statistically significant at the zero percent level, and suggest that all variables in the regression contribute to predicting a player’s penalty skills. The VIF values for the explanatory variables suggest that there is no multicollinearity among the variables, but the variable BallControl is close to the multicollinearity threshhold of 10 with a value of 7.17. The multiple regression model displays an R-squared value of 0.713, and suggests that the model explains 71 percent of the variance in the dependent variable. The histogram of the regression residuals suggest that the residuals are close to normally distributed. The multiple regression model is a good model that can accurately predict a player’s skill in penalty shots.

## Importance of components:
##                          PC1   PC2    PC3    PC4    PC5    PC6    PC7
## Standard deviation     2.274 1.360 0.9360 0.8211 0.6344 0.5581 0.4845
## Proportion of Variance 0.517 0.185 0.0876 0.0674 0.0402 0.0312 0.0235
## Cumulative Proportion  0.517 0.702 0.7896 0.8571 0.8973 0.9284 0.9519
##                           PC8    PC9    PC10
## Standard deviation     0.4611 0.4108 0.31511
## Proportion of Variance 0.0213 0.0169 0.00993
## Cumulative Proportion  0.9732 0.9901 1.00000

Although the multiple linear regression model performs well, we investigate the use of PCA/PCR to attempt to reduce the demensions of the model. The output of the scaled data displays the principal components and their summary statistics. 4 principal components are enough to explain 85 percent of the variation in the data set while 6 principal components are enough to explain 93 percent. The PCA graph displays a graphical representation of the proportion of variance explained by each principal component.

## Data:    X dimension: 14517 9 
##  Y dimension: 14517 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            15.7    11.15    10.67    10.57    8.927    8.542    8.502
## adjCV         15.7    11.15    10.67    10.57    8.926    8.541    8.502
##        7 comps  8 comps  9 comps
## CV       8.503     8.49    8.456
## adjCV    8.503     8.49    8.455
## 
## TRAINING: % variance explained
##            1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X            51.30    71.26    80.84    86.18    90.50    93.92    96.54
## Penalties    49.61    53.86    54.68    67.70    70.42    70.70    70.70
##            8 comps  9 comps
## X            98.88   100.00
## Penalties    70.80    71.03

## [1] "---------MSE Linear Regression---------"
## [1] 70.8
## [1] "---------MSE PCR ---------"
## [1] 69.9
## Data:    X dimension: 18147 9 
##  Y dimension: 18147 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##            1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X            51.28    71.28    80.84    86.19    90.50    93.93
## Penalties    49.87    54.19    54.92    67.64    70.59    70.90

We decide to build a PCA model and perform PCR. Ultimately, we would like to compare the MSE from the multiple linear regresson model to the MSE of the PCR model, and choose the model with the lower MSE. We split the data into training and testing data sets using and 80/20 split. We then fit the PCR model on the training dataset using 10-fold cross validation. The output from the PCR fit of the training data set suggests that 6 principal components achieve the lowest RMSEP value at 8.502. The validation plot of the PCR graphically displays the RMSEP values of each principal component. We then compare the MSE from the linear regression model to the MSE of the PCR model. The MSE of the linear regression is higher (70.8) than the MSE of PCR (69.9). The PCR with 6 principal components performs well and displays a lower MSE then linear regression; therefore, we build a PCR model with 6 principal components and suggest that PCR is the better model.

Smart Question 3

What variables contribute the most to a goalkeeper’s potential? Age, physical factors, current gk skills, etc…

#import data

  1. Residuals vs Fitted

Residuals vs Fitted: This plot shows if residuals have non-linear patterns.

Equally spread residuals around a horizontal line without distinct patterns -> does not show non-linear relationships -> good.

  1. Normal Q-Q

Normal Q-Q: This plot shows if residuals are normally distributed.

Residuals line quite well on the straight line -> normally distributed.

  1. Scale-Location

Scale-Location: This plot shows if residuals are spread equally along the ranges of predictors -> check the assumption of equal variance (homoscedasticity).

Horizontal line has equally (randomly) spread points -> good

  1. Residuals vs Leverage

Residuals vs Leverage: This plot helps us find influential outliers.

Outliers are not influential.

Smart Question 4

How do specific Skill Sets/ Characteristics affect a players wage? How about value? Can we say Skills are a good predictor of Wage and value?

First, we cleaned up and formetted the Wage and Value column by first removing the pound(€) sign. We then wanted to remove the non-numerical characters in the data “M”, “K”, and “.”. To keep the data meaningful, we created a column for M and a column for K for both Wage and Value and copied ththem in their respective columns before removing the characters. The only thing left is numbers and periods. we then multiplied the M column with 1,000,000 and K column with 1,000. After this, We merged them into their respective Wage and Value columns now formatted as numericalv values ready for operations and functions. We cahnged the skill int values to numeric and we also omitted NA values just incase there are some.

Structure of the Datasets

## 'data.frame':    18159 obs. of  35 variables:
##  $ Wage           : num  565000 405000 290000 260000 355000 340000 420000 455000 380000 94000 ...
##  $ Crossing       : num  84 84 79 17 93 81 86 77 66 13 ...
##  $ Finishing      : num  95 94 87 13 82 84 72 93 60 11 ...
##  $ HeadingAccuracy: num  70 89 62 21 55 61 55 77 91 15 ...
##  $ ShortPassing   : num  90 81 84 50 92 89 93 82 78 29 ...
##  $ Volleys        : num  86 87 84 13 82 80 76 88 66 13 ...
##  $ Dribbling      : num  97 88 96 18 86 95 90 87 63 12 ...
##  $ Curve          : num  93 81 88 21 85 83 85 86 74 13 ...
##  $ FKAccuracy     : num  94 76 87 19 83 79 78 84 72 14 ...
##  $ LongPassing    : num  87 77 78 51 91 83 88 64 77 26 ...
##  $ BallControl    : num  96 94 95 42 91 94 93 90 84 16 ...
##  $ Acceleration   : num  91 89 94 57 78 94 80 86 76 43 ...
##  $ SprintSpeed    : num  86 91 90 58 76 88 72 75 75 60 ...
##  $ Agility        : num  91 87 96 60 79 95 93 82 78 67 ...
##  $ Reactions      : num  95 96 94 90 91 90 90 92 85 86 ...
##  $ Balance        : num  95 70 84 43 77 94 94 83 66 49 ...
##  $ ShotPower      : num  85 95 80 31 91 82 79 86 79 22 ...
##  $ Jumping        : num  68 95 61 67 63 56 68 69 93 76 ...
##  $ Stamina        : num  72 88 81 43 90 83 89 90 84 41 ...
##  $ Strength       : num  59 79 49 64 75 66 58 83 83 78 ...
##  $ LongShots      : num  94 93 82 12 91 80 82 85 59 12 ...
##  $ Aggression     : num  48 63 56 38 76 54 62 87 88 34 ...
##  $ Interceptions  : num  22 29 36 30 61 41 83 41 90 19 ...
##  $ Positioning    : num  94 95 89 12 87 87 79 92 60 11 ...
##  $ Vision         : num  94 82 87 68 94 89 92 84 63 70 ...
##  $ Penalties      : num  75 85 81 40 79 86 82 85 75 11 ...
##  $ Composure      : num  96 95 94 68 88 91 84 85 82 70 ...
##  $ Marking        : num  33 28 27 15 68 34 60 62 87 27 ...
##  $ StandingTackle : num  28 31 24 21 58 27 76 45 92 12 ...
##  $ SlidingTackle  : num  26 23 33 13 51 22 73 38 91 18 ...
##  $ GKDiving       : num  6 7 9 90 15 11 13 27 11 86 ...
##  $ GKHandling     : num  11 11 9 85 13 12 9 25 8 92 ...
##  $ GKKicking      : num  15 15 15 87 5 6 7 31 9 78 ...
##  $ GKPositioning  : num  14 14 15 88 10 8 14 33 7 88 ...
##  $ GKReflexes     : num  8 11 11 94 13 8 9 37 11 89 ...
## 'data.frame':    18159 obs. of  35 variables:
##  $ Value          : num  1.10e+08 7.70e+07 1.18e+08 7.20e+07 1.02e+08 ...
##  $ Crossing       : num  84 84 79 17 93 81 86 77 66 13 ...
##  $ Finishing      : num  95 94 87 13 82 84 72 93 60 11 ...
##  $ HeadingAccuracy: num  70 89 62 21 55 61 55 77 91 15 ...
##  $ ShortPassing   : num  90 81 84 50 92 89 93 82 78 29 ...
##  $ Volleys        : num  86 87 84 13 82 80 76 88 66 13 ...
##  $ Dribbling      : num  97 88 96 18 86 95 90 87 63 12 ...
##  $ Curve          : num  93 81 88 21 85 83 85 86 74 13 ...
##  $ FKAccuracy     : num  94 76 87 19 83 79 78 84 72 14 ...
##  $ LongPassing    : num  87 77 78 51 91 83 88 64 77 26 ...
##  $ BallControl    : num  96 94 95 42 91 94 93 90 84 16 ...
##  $ Acceleration   : num  91 89 94 57 78 94 80 86 76 43 ...
##  $ SprintSpeed    : num  86 91 90 58 76 88 72 75 75 60 ...
##  $ Agility        : num  91 87 96 60 79 95 93 82 78 67 ...
##  $ Reactions      : num  95 96 94 90 91 90 90 92 85 86 ...
##  $ Balance        : num  95 70 84 43 77 94 94 83 66 49 ...
##  $ ShotPower      : num  85 95 80 31 91 82 79 86 79 22 ...
##  $ Jumping        : num  68 95 61 67 63 56 68 69 93 76 ...
##  $ Stamina        : num  72 88 81 43 90 83 89 90 84 41 ...
##  $ Strength       : num  59 79 49 64 75 66 58 83 83 78 ...
##  $ LongShots      : num  94 93 82 12 91 80 82 85 59 12 ...
##  $ Aggression     : num  48 63 56 38 76 54 62 87 88 34 ...
##  $ Interceptions  : num  22 29 36 30 61 41 83 41 90 19 ...
##  $ Positioning    : num  94 95 89 12 87 87 79 92 60 11 ...
##  $ Vision         : num  94 82 87 68 94 89 92 84 63 70 ...
##  $ Penalties      : num  75 85 81 40 79 86 82 85 75 11 ...
##  $ Composure      : num  96 95 94 68 88 91 84 85 82 70 ...
##  $ Marking        : num  33 28 27 15 68 34 60 62 87 27 ...
##  $ StandingTackle : num  28 31 24 21 58 27 76 45 92 12 ...
##  $ SlidingTackle  : num  26 23 33 13 51 22 73 38 91 18 ...
##  $ GKDiving       : num  6 7 9 90 15 11 13 27 11 86 ...
##  $ GKHandling     : num  11 11 9 85 13 12 9 25 8 92 ...
##  $ GKKicking      : num  15 15 15 87 5 6 7 31 9 78 ...
##  $ GKPositioning  : num  14 14 15 88 10 8 14 33 7 88 ...
##  $ GKReflexes     : num  8 11 11 94 13 8 9 37 11 89 ...

HISTOGRAM

We transformed the Wage and Value because they were heavily skewed to the right. After the log transformation, the historgam looks a little more normal. For Wage however, although it is better, is still right skewed. We tried using Tukey transformation which is stronger than log() but it was only limited to about 5,000 observations. We have morethan 1 million observations. This is one of our limitations.

4.1 Multiple Regression

To answer the question, we performed multiple regressions for Wage and Value with Skills and have the results shown below. We showed the top 5 skills that explain Wage and Value.

Wage-Skills

## 
## Call:
## lm(formula = Wage ~ ., data = data_wage_skills)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37769  -8208  -2634   4177 510805 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -86754.2     1889.8  -45.91  < 2e-16 ***
## Crossing           -29.8       18.6   -1.61  0.10809    
## Finishing           68.0       23.4    2.91  0.00367 ** 
## HeadingAccuracy    151.8       19.6    7.74  1.1e-14 ***
## ShortPassing       162.7       33.3    4.89  1.0e-06 ***
## Volleys            101.6       20.3    5.01  5.5e-07 ***
## Dribbling          100.9       29.2    3.45  0.00055 ***
## Curve               47.9       20.0    2.39  0.01676 *  
## FKAccuracy         -35.2       17.7   -1.99  0.04660 *  
## LongPassing        -70.7       24.4   -2.89  0.00380 ** 
## BallControl        209.5       35.8    5.86  4.7e-09 ***
## Acceleration        60.4       27.6    2.18  0.02899 *  
## SprintSpeed         83.6       25.6    3.26  0.00112 ** 
## Agility           -107.6       20.5   -5.25  1.6e-07 ***
## Reactions          682.6       27.0   25.26  < 2e-16 ***
## Balance             -7.2       18.5   -0.39  0.69646    
## ShotPower           25.3       20.8    1.22  0.22267    
## Jumping             18.7       14.5    1.29  0.19702    
## Stamina            -86.2       16.8   -5.13  2.9e-07 ***
## Strength           -31.9       17.4   -1.83  0.06757 .  
## LongShots         -125.2       22.1   -5.67  1.5e-08 ***
## Aggression         -48.0       15.4   -3.11  0.00186 ** 
## Interceptions      -62.4       22.3   -2.79  0.00523 ** 
## Positioning        -89.1       21.8   -4.09  4.3e-05 ***
## Vision              70.8       20.2    3.50  0.00046 ***
## Penalties           32.7       19.2    1.70  0.08853 .  
## Composure          212.4       21.8    9.76  < 2e-16 ***
## Marking             73.7       17.9    4.13  3.7e-05 ***
## StandingTackle      82.2       33.4    2.46  0.01373 *  
## SlidingTackle       31.5       30.9    1.02  0.30873    
## GKDiving           174.0       41.7    4.18  3.0e-05 ***
## GKHandling         176.8       42.2    4.19  2.7e-05 ***
## GKKicking           66.5       38.9    1.71  0.08743 .  
## GKPositioning      -19.8       41.2   -0.48  0.63035    
## GKReflexes          97.6       41.4    2.36  0.01828 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18600 on 18124 degrees of freedom
## Multiple R-squared:  0.287,  Adjusted R-squared:  0.286 
## F-statistic:  214 on 34 and 18124 DF,  p-value: <2e-16

Top 5 Skills that Explain Wage: Reactions - 682.6, Composure - 212.4, Ball Control - 209.5, GK Handling - 176.8, GK Diving - 174.0

Value-Skills

## 
## Call:
## lm(formula = Value ~ ., data = data_value_skills)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
##  -9596403  -2109062   -698085   1114023 103926217 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -24282460     460873  -52.69  < 2e-16 ***
## Crossing           -29294       4524   -6.47  9.8e-11 ***
## Finishing           29982       5708    5.25  1.5e-07 ***
## HeadingAccuracy     35257       4786    7.37  1.8e-13 ***
## ShortPassing        56542       8115    6.97  3.3e-12 ***
## Volleys             24798       4944    5.02  5.3e-07 ***
## Dribbling           33102       7122    4.65  3.4e-06 ***
## Curve                9670       4879    1.98  0.04749 *  
## FKAccuracy           5202       4314    1.21  0.22780    
## LongPassing        -10367       5956   -1.74  0.08177 .  
## BallControl         52684       8719    6.04  1.5e-09 ***
## Acceleration        35510       6742    5.27  1.4e-07 ***
## SprintSpeed         25128       6254    4.02  5.9e-05 ***
## Agility            -34474       5002   -6.89  5.7e-12 ***
## Reactions          208049       6590   31.57  < 2e-16 ***
## Balance             -7438       4501   -1.65  0.09845 .  
## ShotPower             592       5068    0.12  0.90708    
## Jumping             -1521       3539   -0.43  0.66727    
## Stamina              -236       4095   -0.06  0.95395    
## Strength           -11327       4251   -2.66  0.00771 ** 
## LongShots          -38616       5389   -7.17  8.1e-13 ***
## Aggression         -19917       3765   -5.29  1.2e-07 ***
## Interceptions      -26809       5447   -4.92  8.6e-07 ***
## Positioning        -31442       5313   -5.92  3.3e-09 ***
## Vision              22478       4926    4.56  5.1e-06 ***
## Penalties          -12013       4676   -2.57  0.01020 *  
## Composure           53780       5304   10.14  < 2e-16 ***
## Marking             22743       4355    5.22  1.8e-07 ***
## StandingTackle      37817       8133    4.65  3.3e-06 ***
## SlidingTackle      -15713       7538   -2.08  0.03713 *  
## GKDiving            44531      10160    4.38  1.2e-05 ***
## GKHandling          35418      10280    3.45  0.00057 ***
## GKKicking           19481       9487    2.05  0.04004 *  
## GKPositioning       -2880      10050   -0.29  0.77442    
## GKReflexes          27251      10084    2.70  0.00689 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4540000 on 18124 degrees of freedom
## Multiple R-squared:  0.344,  Adjusted R-squared:  0.343 
## F-statistic:  280 on 34 and 18124 DF,  p-value: <2e-16

Top 5 Skills that Explain Value: Reactions - 208049, Short Passing - 56542, Composure - 53780, Ball Control - 52684, GK Diving - 44531

Result and Findings from the Multiple Regression

What is interesting is the skills that affect value and wage the most are reactions, composure, ball control and GK diving. all these made top 5 skills that explain wage and value. These skills are the best predictors of wage and value. We can see however that other than these four skills, GK handling seems to explain wage and Short passing seems to explain a Players value.

4.2 Analysis of the Models

We analysed the models to see if it holds with the models assumptions. We plotted the residuals and looked at the VIF values.

Wage Plot of the Residuals

Value Plot of the Residuals

Findings from the Plots

We reviewed the plots and both Wage and value have similar results. These are our observations:

Residual vs fitted - What we want is a linear pattern and we dont want non linear patterns. It seems like in this case even if it is curved the pattern is quite linear considering we have a huge dataset (its okay)

QQ plot - What we want here is the residual points not deviating so much from the line. In this case, they deviate off the line severely (not good)

Scale location - We want residuals randomly spread throughout the line. In this case residuals not spread randomly spread, and appears that it is concentrated in the left and starts to spread out on the right (not good)

Residual vs Leverage - we want residuals to be inside Cooks Distance or the dashed line. In this case, we cannot see the dashed line. We are not sure if it is inside or outside the dashed line.

Wage VIF

##        Crossing       Finishing HeadingAccuracy    ShortPassing 
##            6.08           10.94            6.09           12.54 
##         Volleys       Dribbling           Curve      FKAccuracy 
##            6.74           15.98            7.10            5.01 
##     LongPassing     BallControl    Acceleration     SprintSpeed 
##            7.34           18.65            8.92            7.39 
##         Agility       Reactions         Balance       ShotPower 
##            4.81            3.11            3.57            6.72 
##         Jumping         Stamina        Strength       LongShots 
##            1.54            3.73            2.51            9.49 
##      Aggression   Interceptions     Positioning          Vision 
##            3.77           11.20            9.49            4.28 
##       Penalties       Composure         Marking  StandingTackle 
##            4.75            3.24            6.62           27.35 
##   SlidingTackle        GKDiving      GKHandling       GKKicking 
##           22.69           28.48           26.62           21.59 
##   GKPositioning      GKReflexes 
##           25.82           28.88

Value VIF

##        Crossing       Finishing HeadingAccuracy    ShortPassing 
##            6.08           10.94            6.09           12.54 
##         Volleys       Dribbling           Curve      FKAccuracy 
##            6.74           15.98            7.10            5.01 
##     LongPassing     BallControl    Acceleration     SprintSpeed 
##            7.34           18.65            8.92            7.39 
##         Agility       Reactions         Balance       ShotPower 
##            4.81            3.11            3.57            6.72 
##         Jumping         Stamina        Strength       LongShots 
##            1.54            3.73            2.51            9.49 
##      Aggression   Interceptions     Positioning          Vision 
##            3.77           11.20            9.49            4.28 
##       Penalties       Composure         Marking  StandingTackle 
##            4.75            3.24            6.62           27.35 
##   SlidingTackle        GKDiving      GKHandling       GKKicking 
##           22.69           28.48           26.62           21.59 
##   GKPositioning      GKReflexes 
##           25.82           28.88

MULTICOLLINEARITY

It seems like we have multicollinearity since our VIFs are way greater than 5. The independent variables are too correlated with each other and these Values Suggest that they are poorly estimated.

Smart Question 5

How do physical factors (age, preferred foot, overall skill, international reputation, position in the field, weight, and height) help the soccer professionals to get wage for higher 100,000 Euro per week?

7.2 Exploratory Data Analysis

In this section, we perform the exloratory data analysis (EDA) on several numerical variables in the dataset. After assessing some of the characteristics of those variables, we decided to clean the data by omitting outliers and NA values.

7.2.1. Data tranformation and Descriptive Statistics

Data tranformation

In this chapter, we use logistic regression with 7 variables including age, skill level, wage, preferred foot, position, height, and weight. However, we have not cleaned the data yet.

The original data are below:

After transforming some variables, we got the new data. There are 4 numeric variables (age, skill level, height, and weight) and 3 categorical variables (Wage_dummy, position, and preferred foot)

The new data are below:

Descriptive Statistics

The descriptive statistics are presented below:

According to the above statistics, it can be seen that mean and median of 4 numeric variables are very close. However, There are a lot of missing values (NAs) in height, weight, position, and preferred foot.

7.2.2. Normality (Q-Q plots)

In this step, we need to check normality for 4 numeric variable, including age, skill level, height, and weight.

According to the above Q-Q plots, these four numeric variables are not far from normal distribution. However, the tails in each Q-Q plot indicate that the data have the outliers. Therefore, removing outliers process is required before doing an analysis in the next step.

7.3 Pre-Logistic Regression

Does the data support that “Age”, “Preferred foot”, “Overall skill level”, “Preferred position”, “Weight”, and “Height” affect “Wage”?

7.3.1 Age VS Wage

To study the effects on wage by the age (one categorical variable and one qualitative variable, respectively), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## Warning in chisq.test(wageagetable): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  wageagetable
## X-squared = 46, df = 22, p-value = 0.002

From the small p-value of 0.002, we reject the null hypothesis that the “Wage” and “Age” are independent. Therefore, the data supports “age” may have an effect on “Wage”.

7.3.2 Preferred foot VS Wage

To study the effects on wage by preferred foot (both of them categorical variables), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  wagefoottable
## X-squared = 0.1, df = 1, p-value = 0.7

From the large p-value of 0.729, we fall to reject the null hypothesis that the “Wage” and “Foot” are independent. Therefore, the data supports “Prefered foot” might not have an effect on “Wage”.

7.3.3 Wage VS Overall skill level

To study the effects on wage by the skill level (one categorical variable and one qualitative variable, respectively), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## Warning in chisq.test(wageskilltable): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  wageskilltable
## X-squared = 4061, df = 35, p-value <2e-16

From the small p-value of 0, we reject the null hypothesis that the “Wage” and “Skill level” are independent. Therefore, the data supports “Skill level” might have an effect on “Wage”.

7.3.4 Wage VS Height

To study the effects on wage by the height (one categorical variable and one qualitative variable, respectively), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## Warning in chisq.test(wageheitable): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  wageheitable
## X-squared = 10, df = 16, p-value = 0.9

From the large p-value of 0.891, we fall to reject the null hypothesis that the “Wage” and “Height” are independent. Therefore, the data supports “Height” might not have an effect on “Wage”.

7.3.5 Wage VS Weight

To study the effects on wage by the weight (one categorical variable and one qualitative variable, respectively), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## Warning in chisq.test(wageweitable): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  wageweitable
## X-squared = 35, df = 40, p-value = 0.7

From the large p-value of 0.685, we fall to reject the null hypothesis that the “Wage” and “weight” are independent. Therefore, the data supports “Weight” might not have an effect on “Wage”.

7.3.6 Wage VS Perferred position in the field

To study the effects on wage by the position (two categorical variables), we can create two-way contingency table of the outcome and predictors.

Chi squared test

We can then quickly run a chi-squared test to see if the two are independent (or same frequency distribution).

## 
##  Pearson's Chi-squared test
## 
## data:  wagepositable
## X-squared = 7, df = 3, p-value = 0.07

From the small p-value of 0.071, we reject the null hypothesis that the “Wage” and “Position” are independent at the 1% significance level. Therefore, the data supports “Position” might have an effect on “Wage”.

7.4 Logistic Regression

Let us now turn our attention to logistic regression models. We run the Wage model including age, preferred.Foot, overall skill level, preferred position in the field, weight, and height.

We can see the summary of the logistic regression model here:

## 
## Call:
## glm(formula = Wage_dummy ~ Age + Preferred.Foot + Overall + Position + 
##     Weight + Height, family = "binomial", data = fifaNew)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.368  -0.012  -0.003  -0.001   3.225  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -59.27247    5.82873  -10.17   <2e-16 ***
## Age                  -0.05785    0.03387   -1.71    0.088 .  
## Preferred.FootRight  -0.02402    0.27749   -0.09    0.931    
## Overall               0.71730    0.05174   13.86   <2e-16 ***
## PositionDEF           1.42780    0.59407    2.40    0.016 *  
## PositionMID           1.35141    0.60210    2.24    0.025 *  
## PositionFWD           1.49305    0.62545    2.39    0.017 *  
## Weight               -0.00570    0.01281   -0.45    0.656    
## Height                0.00184    0.03025    0.06    0.952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1184.30  on 17848  degrees of freedom
## Residual deviance:  506.66  on 17840  degrees of freedom
## AIC: 524.7
## 
## Number of Fisher Scoring iterations: 11

Model interpretation:

The age effect on wage

According to the above regression results, the cofficient on age indicates that for every one year increases in age, ln(odds-ratio) of wage at higher 100,000 Euro per week level decreases by 0.06. In other word, it means that the young player have more opportunity to earn higher 100k Euro per week than the olders. Furthermore, it is statistically significant.

The preferred foot effect on wage

For preferred foot: use left foot as baseline, ln(odds-ratio) of wage at higher 100,000 Euro per week level deceases by 0.02 when changing from left to right foot. So, it is likely that the left foot players have more the opportunity to earn higher 100,000 Euro per week than the right foot player. However, it is not statistically significant.

The skill level effect on wage

The cofficient on skill level indicates that for every one level increases in overall skill, ln(odds-ratio) of wage at higher 100,000 Euro per week level increase by 0.72. In other word, it means that the high skill players have more opportunity to earn higher 100k Euro per week than the low skill players. In addition, it is statistically significant.

The effect of position of the player in the field on wage

For position: use Goalkeeper as baseline, ln(odds-ratio) of wage at higher 100,000 Euro per week level increases by 1.42 when changing from Goalkeeper to Defender. And, ln(odds-ratio) of higher 100,000 euro wage increases by 1.35 when changing from Goalkeeper to Midfielder. Finally, ln(odds-ratio) of higher 100,000 euro wage increases by 1.49 when changing from Goalkeeper to Forward. It is likely that the opportunity to earn higher 100k Euro per week will increase if the preferred position of soccer players are far from goalkeeper. In addition, the case is statistically significant.

The weight effect on wage

The cofficient on weight indicates that for every one lbs increases in weight, ln(odds-ratio) of wage at higher 100,000 Euro per week level decreases by 0.006. In other word, it means that the low weight players have more opportunity to earn higher 100k Euro per week than the high weight players. However, this case is not statistically significant.

The height effect on wage

The cofficient on height indicates that for every one centimeter (CM) increases in height, ln(odds-ratio) of wage at higher 100,000 Euro per week level decreases by 0.002. In other word, it seems that the short players have more opportunity to earn higher 100k Euro per week than the tall players height. However, it is not statistically significant.

7.5 Model evaluation

7.5.1. ROC curve and AUC

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.

The result is shown here:

We have here the area-under-curve of 0.991, which is higher than 0.8. This indicates the model is really a good fit, and all the coefficients are significant.

7.5.2. McFadden

McFadden is another evaluation tool we can use on logitistic regressions. This is part of what is called pseudo-R-squared values for evaluation tests.

In this case, the McFadden value is 0.572, which is analgous to the coefficient of determination R\(2\), only about 5.7% of the variations in y is explained by the explanatory variables in the model.